On the Robustness of Authorship Attribution Based on Character N-gram Features
نویسنده
چکیده
A number of independent authorship attribution studies have demonstrated the effectiveness of character n-gram features for representing the stylistic properties of text. However, the vast majority of these studies examined the simple case where the training and test corpora are similar in terms of genre, topic, and distribution of the texts. Hence, there are doubts whether such a simple and low-level representation is equally effective in realistic conditions where some of the above factors are not possible to remain stable. In this study, the robustness of authorship attribution based on character n-gram features is tested under cross-genre and cross-topic conditions. In addition, the distribution of texts over the candidate authors varies in training and test corpora to imitate real cases. Comparative results with another competitive text representation approach based on very frequent words show that character n-grams are better able to capture stylistic properties of text when there are significant differences among the training and test corpora. Moreover, a set of guidelines to tune an authorship attribution model according to the properties of training and test corpora is provided. * Assistant Professor, University of the Aegean. 422 JOURNAL OF LAW AND POLICY
منابع مشابه
Authorship Attribution in Bengali Language
We describe Authorship Attribution of Bengali literary text. Our contributions include a new corpus of 3,000 passages written by three Bengali authors, an end-toend system for authorship classification based on character n-grams, feature selection for authorship attribution, feature ranking and analysis, and learning curve to assess the relationship between amount of training data and test accu...
متن کاملMaximal Repeats Enhance Substring-based Authorship Attribution
This article tackles the Authorship Attribution task according to the language independence issue. We propose an alternative of variable length character n-grams features in supervised methods: maximal repeats in strings. When character ngrams are by essence redundant, maximal repeats are a condensed way to represent any substring of a corpus. Our experiments show that the redundant aspect of n...
متن کاملThe effect of author set size and data size in authorship attribution
Applications of authorship attribution ‘in the wild’ [Koppel, M., Schler, J., and Argamon, S. (2010). Authorship attribution in the wild. Language Resources and Evaluation. Advanced Access published January 12, 2010:10.1007/ s10579-009-9111-2], for instance in social networks, will likely involve large sets of candidate authors and only limited data per author. In this article, we present the r...
متن کاملAuthor Identification Using Different Sizes of Documents: A Summary
In the present research work, we deal with the problem of authorship attribution of ancient Arabic text documents, which were written by several ancient philosophers. For that purpose, we conducted several authorship attribution experiments applied with different text sizes. A special dataset, called “A4P” (Authorship Attribution for Ancient Arabic Philosophers), has been constructed by extract...
متن کاملNot All Character N-grams Are Created Equal: A Study in Authorship Attribution
Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of character n-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013